This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
1-Which genres are most popular from year to year?
2-Which year has the highest release of movies?
3-Which movies are the most popular of all time?
4-Which movie title had the highest budget?
5-Does a bigger film production budget result in more popularity?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')
df.head(10)
df.shape
df.describe()
df.info()
# hist of the date before cleaning
df.hist(figsize=(15,10));
df.duplicated().sum()
#drop duplicates
df.drop_duplicates(inplace =True);
#be sure duplicated removed
df.duplicated().sum()
df.shape
df.isnull().sum()
df[df.cast.isnull()]
df['tagline'].value_counts()
df.dropna(inplace=True)
df.isnull().sum()
df.shape
df.hist(figsize=(15,10));
df.info()
df.describe()
we will work on the types of genres and get the popularity of each type so we can get the most popular
genres = df['genres'].value_counts()
genres
genres.plot(kind='barh',figsize=(7,100))
plt.ylabel('geners')
plt.xlabel('the values of geners')
plt.title('popularity of genres')
def separate_count(column):
split_data = pd.Series(df[column].str.cat(sep = '|').split('|'))
count_data = split_data.value_counts(ascending=False)
return count_data
# Plot pie relationship between genre and number of movies
separate_count("genres").plot(kind="pie",figsize=(9,9),autopct="%1.1f%%")
# the title of the plot
plt.title('Percentage Of Genres')
plt.ylabel('');
the highest release year is 2011
release = df['release_year'].value_counts()
release
release.plot(kind='bar',figsize=(15,5))
plt.xlabel('years')
plt.ylabel('the value of realse')
plt.title('Number of movies in each year')
I think the most popular movie that have the max popularity
df['popularity'].max()
df[df['popularity']==32.985763]
We will work on two variable ('budget') and ('original_title') , so we will get the max of budget
df[df['budget']==df['budget'].max()]
df['budget'].describe()
plt.pie(df['budget'],labels=df['original_title']);
plt.title('budget')
plt.legend(df['original_title'])
plt.show()
we will see the effect of the variation of budget on popularity .
so we will work on popularity and budget
sns.regplot(x=df['budget'],y=df['popularity']).set_title('relation between budget and popularity')
Tip:
1- there are many nan values in cast coulmn
2- there are many zerose in budget and revenues
3- there are many nan values in tageline
4- duplicated rows
5- incorrect data types
Tip: most popular from year to year is drama
the max number of movies was in 2011
Jurassic World film has the highest popularity
The Warrior's Way film has the highest budget
there are a positive relation between budget and popularity with few exceptions